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Abstract 

In terms of making genes expression data more interpretable and comprelnensible, there exists a significant superiority on 
sparse methods. Many sparse methods, such as penalized matrix decomposition (PMD) and sparse principal component 
analysis (SPCA), have been applied to extract plants core genes. Supervised algorithms, especially the support vector 
machine-recursive feature elimination (SVM-RFE) method, always have good performance in gene selection. In this paper, 
we draw into class information via the total scatter matrix and put forward a class-information-based penalized matrix 
decomposition (CIPMD) method to improve the gene identification performance of PMD-based method. Firstly, the total 
scatter matrix is obtained based on different samples of the gene expression data. Secondly, a new data matrix is 
constructed by decomposing the total scatter matrix. Thirdly, the new data matrix is decomposed by PMD to obtain the 
sparse eigensamples. Finally, the core genes are identified according to the nonzero entries in eigensamples. The results on 
simulation data show that CIPMD method can reach higher identification accuracies than the conventional gene 
identification methods. Moreover, the results on real gene expression data demonstrate that CIPMD method can identify 
more core genes closely related to the abiotic stresses than the other methods. 
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introduction 

The changing environmental conditions have a significant 
impact on the survival and growth of plants. A series of various 
abiotic stresses can bring about the overproduction of reactive 
oxygen species in plants, which may damage carbohydrates, 
proteins, lipids and DNA resulting in oxidative stress [1]. In order 
to cope with these abiotic stresses, including cold, drought, heat, 
osmotic press, salt, UV-B light stresses, etc., plants have their own 
defense mechanisms to adapt the complex and changeful 
environment [2,3] . In other words, a particular set of interacting 
plants genes which are always called core genes exist in responding 
to each abiotic stress. Hence, how to extract these core genes is 
becoming a very meaningful issue in plant science. 

With the development of science and technology, the emer- 
gence of gene microarray technology [4,5] makes it possible for 
researchers to monitor gene expression levels on a genomic scale 
[6,7]. This not only brings us more opportunities but also more 
challenges to study the gene expression data. Although the DNA 
microarray technology allows researchers to measure the expres- 
sion levels of thousands (even more than 10,000) of genes in an 



experiment simultaneously, the gene expression data also have the 
problem: the characteristic genes which biologists are interested in 
occupy a very small part of the whole genes. It is difficult for us to 
catch the small but important part of the whole genes due to the 
complexity and multidimensionality of the gene expression data. 
Therefore, it becomes an urgent issue how to identify the 
characteristic genes from gene expression data in an effective way. 

Among a variety of methods, feature selection is demonstrated 
to be a simple and effective method. To obtain the features of gene 
expression data, feature selection methods firstiy calculate a score 
for each feature, then choose the features which gain high scores 
[8]. These methods can achieve a satisfactory performance and 
have a significant superiority on explaining the gene expression 
data more intuitive. But there exists a shortcoming that feature 
selection methods neglect the dependencies among features since 
they only calculate the score for each feature respectively. The 
appearance of feature extraction methods can overcome the 
shortcoming in an effective way [9]. As a tool to reduce the 
dimension, feature extraction methods take all the gene expression 
information simultaneously into consideration to extract the genes 
instead of feature selection methods. Until now, singular value 
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decomposition (SVD) and principal component analysis (PCA) are 
commonly used methods of feature extraction. For example 
Kumar et al. applied SVD on Tuberculosis and Hypertension 
datasets to mine association in health care data [10]. Aradhya et 
al. used SVD for biclustering gene expression data [1 1]. PCA was 
used to cluster gene expression data by Yeung et al. [12]. PCA was 
used to select genes for microarray data analysis by Wang et al. 
[13]. Ma et al. applied PCA for identifying differential gene 
pathways [14]. 

Although SVD and PCA have already been used to analyze the 
gene expression data successfully, they still have some defects. For 
instance, SVD's left singular vectors and right singular vectors are 
always dense. In the same way, this drawback exists in the 
principal components (PCs) of PCA. Thus, it is difficult to explain 
these singular vectors and PCs objectively. Researchers have 
proposed a variety of mathematical methods to reduce the 
complexity of the data and make them more intelligible and 
in terpre table. For example Liu et al. proposed robust PCA for 
discovering difiFerentiaUy expressed genes [15]. Wang et al. used 
non-negative matrix factorization (NMF) on cancer clustering 
[16]. Among these methods, sparse methods have distinct 
advantages and catch the attention of more and more people. 
Until now, a large number of sparse methods were proposed. For 
instance, Wang et al. raised robust sparse PCA (SPCA) by using 
weighted elastic net [17]. A sparse PCA via low-rank approxima- 
tions was proposed by Papailiopoulos et al. [18]. Witten et al. 
proposed a penalized matrix decomposition [19], which was used 
for differential expression analysis [20,21]. In addition, many 
sparse methods have already been chosen to deal with the gene 
expression data. Liu et al. used the first principal component (PC) 
of SPCA for extracting plants core genes [22]. Yin et al. identified 
differential gene pathways with SPCA [23]. Zheng et al. 
discovered molecular pattern [24] based on PMD. 

The sparse methods mentioned above were proverbially applied 
on gene expression data analysis and have made many remarkable 
achievements. But these methods are usually unsupervised while 
the category label of each sample in gene expression data has been 
already known. That is, the class information is neglected by these 
sparse methods when processing gene expression data. For 
example PMD was used to extract plants core genes by Liu et 
Ell. [20]. However, the category labels of samples are quite 
important for gene identification that many excellent gene 
selection algorithms were achieved by using the class information. 
For instance Guyon et al. proposed the Support Vector Machine- 
Recursive Feature Elimination (SVM-RFE) method to select genes 
for cancer classification [25]. SVM-RFE is a classic gene selection 
algorithm that it eliminate genes one by one by using Recursive 
Feature Elimination (RFE) and achieve a very good performance. 
Many extensions on SVM-RFE algorithm were proposed by 
scholars. Tang et al. developed a new two-stage SVM-RFE 
algorithm to gene selection for microarray expression data analysis 
[26]. Ding et al. impr()\'cd the computational performance of 
SVM-RFE by eliminating chunks of features at a time with little 
effect on the quality of reduced feature set [27]. Since SVM-RFE 
was designed to handle the binary feature selection problems, it is 
not suitable for multiclass feature selection problems. In order to 
solve this issue, Zhou et al. proposed a family of four extensions to 
SVM-RFE to solve these problems [28]. Duan et al. computed the 
feature ranking score from a statistical analysis of weight vectors of 
multiple Knear SVMs trained on subsamples of the original 
training data at each step [29]. 

Therefore, we bring in the class information by the total scatter 
matrix and put forward a novel method to improve the 
performance of PMD-based gene extraction method for identify- 



ing plants core genes responding to abiotic stresses. We called it a 
Class-information-based Penalized Matrix Decomposition 
(CIPMD). The scheme of CIPMD is as follows. Firsdy, the total 
scatter matrix is obtained based on samples of the gene expression 
data. Secondly, we decompose the total scatter matrix by using 
SVD, and construct a new data matrix via multiplying the left 
singular vectors by the singular values. Thirdly, the new data 
matrix is decomposed to get the sparse eigensamples by PMD. 
Finally, the core genes are identified according to the nonzero 
entries in eigensamples. 

Our main contributions of this paper are as follows. On one 
hand, it is the first time that it puts forward the CIPMD method 
via integrating the class information into penalized matrix 
decomposition. On the other hand, to identify plants core genes 
responding to abiotic stresses, it provides plenty of experiments on 
simulation and real gene expression data. 

The remainder of the paper is organized as follows. Section 2 
describes the methodology of CIPMD. Section 3 presents the 
numerical experiments and discusses the results. The conclusion is 
shown in Section 4. 

Methodology 

In this section, the class-information-based penalized matrix 
decomposition (CIPMD) method is introduced. Then, it is used to 
identify the core genes responding to the abiotic stresses. 

2.1 The definition of CIPMD 

In this subsection, we take the class information of samples into 
account and propose the CIPMD method to gain a better 
performance than PMD. At first, we bring in the class information 
via the total scatter matrix S(. Then, a new data matrix is 
constructed by decomposing S(. Finally, the new data matrix is 
decomposed by PMD. The following is our specific idea. 

2.1.1 The definition of scatter matrices. There exist many 
samples which contain different class labels in gene expression 
data. We take advantage of the class labels of samples via the total 
scatter matrix. For all the samples of all classes, we define three 
measures from the mathematical point of view. The first measure 
is named as a between-class scatter matrix (St) that is written as 
follows: 

c 

Si,= ^Nj{tij-tJ,){lij-ti) , (1) 
J=i 

where 

c: the number of classes; 
Nj-. the number of samples in class J; 
Hy. the average value of class j; 
H: the average value of all classes. 

The second measure is named as a within-class scatter matrix 
(S^) that is defined by 

S-=EE('<--^7)(^-/^yy' (2) 

/ — 1 / — 1 

where xj- represents the i-th sample of class j. 

The third measure is named as the total scatter matrix (S() 
which is defined based on Sj and S„,. In order to minimize the 
within-class distance and maximize the between-class distance, the 
formula of S, is given as follows: 
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Figure 1. The graphical depiction of CIPMD. In this figure, the matrix F is decomposed into two bases matrices U, V and a diagonal matrix D. 
doi:1 0.1 371/journal.pone.01 06097.g001 



Sj — Si — jjSh,, 



(3) 



where >;>0 represents an adjustable parameter and gives a 
compromise between Sfe and Sh,. 

The between-class and the within-class distances can be 
calculated by the trace of corresponding scatter matrices. In 
detail, the formulas are as follows: 



?race(Si) = trace 

= kh\-\-Xhi-^ ■ ■ ■ + hk. 



L/=i 



(4) 



trace{S„) = trace 



.7 = 1 i=l 
= X-wl + ^wl -|- • • • -|- X-wk 



(5) 



In the two formulas above, the separation of the samples 
between classes can be measured by the trace(Sh) while the 
closeness of the samples within classes can be measured by the 
trace{Syf). The parameter »j in eq. (3) is defined by [30] 



trace(Si,) 
trace{S„) 



(6) 
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Figure 2. Accuracies of the CiPIVID on simulation data set with different values of m. 
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2.1.2 Constructing the new data matrix F. Due to the 

total scatter matrix S; should be processed by PMD in a 
convenient way, so S( is preprocessed by matrix decomposition 
methods. 

Firstly, the total scatter matrix S( is decomposed by SVD, which 
can be written as follows: 



S(=WAff, 



(7) 



where W and H are orthogonal matrices, A = diag{Si,S2, ■ ■ ■ ,S„) 
is the diagonal matrix which contains singular vEilues, n is the rank 
of the total scatter matrix S(. 

Secondly, a new data matrix F is constructed by 



F = WA'' 



(8) 



where m is the power of A. The suitable value of m can be 
determined in Subsection 3.1.2 by using the simulation data. 
Finally, the new data matrix F is decomposed by PMD. 



In this way, the total scatter matrix S( which contains large 
amounts of complex data is converted to the new data matrix F 
which is simple and easy to be processed. 

2.1.3 Penalized matrix decomposition (PMD), In this 

subsection, we briefly introduce the PMD method proposed by 
Witten et al. [19]. Gene expression data always consist of^ genes 
in n samples, in general, p> >n. According to subsection 2.1.1 
and subsection 2.1.2, the new data matrix F is obtained by 
calculating the original gene expression data. Therefore, we 
denote the gene expression data by the matrix F with size p xn. 
Without loss of generality, we let the row mean of F be zero. The 
matrix F can be decomposed by SVD as follows: 



F = UDV^ 



(9) 



where U is a ^ x r orthogonal matrix, V is an « X r orthogonal 
matrix and D is a diagonal matrix. PMD can generalize this 
decomposition by imposing constraints on U and/ or V. PMD can 
be represented as the following optimization problem: 
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Figure 4. Accuracies of these metKiods on simulation data with different paramaters in the case of multi-class. 
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When r=l, u and V satLsfying eq.(lO) can also satisfy the 
optimization problem as follows [19]: 



(10) 



s.t. ||u||2 = l, l|v||2 = l, Pi(u)<ai, P2(v)<o:2, d>0. 



maxu Fv 

u,v 



where 

u^: the column k of U; 

v/c ■ the column kofV; 

dj^: the k-lh diagonal element of D; 

11 •11^: the Frobenius norm; 

Pi and Pj' convex penalty functions that can adopt a various of satisfies eq.(lO) [19] 
forms [19]. 



s.t. Ml<\, ||v||^<l,Pi(u)<o:i,P2(v)<o!2, 



(11) 



and the d satisfying eq.(lO) is fif<-u^Fv. The objection function 
u^Fv in eq.(l 1) is bilinear on u and v, that is to say, when u fixed, it 
is linear in v, and vice versa. By choosing the appropriate a.\ and 
0.2, the solution to eq.(l 1) which is named as rank-one PMD 



Table 1. The number of each stress types in the raw data. 
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Figure 5. Response to stimulus (GO:0050896) in shoot samples. 
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The iterative algorithm for rank-one PMD is summarized as 
follows: 

Stepl. Initialize V to have unit L2-norm. 
Step2. Iterate until convergence: 

(a) u<- argmax„u''Fv,.5.?.[|u[|2< l,/'i(u)<o!i. 

(b) ¥•<- argmaXyU^Fv,5.?.||v||2< l,/'2(v)<a2- 
Step3. d<-a^¥y. 

In order to obtain the rank-r PMD, each time we use the 
residuals obtained by subtracting duy^ from F to maximize the 
eq.(ll) repeatedly, i.e., F*+'<-F* — t/fcUfcvl". The specific algo- 
rithm of rank-r PMD can be found in [19]. In this research, we 
only impose the penalty on u, i.e. Pi{u) <ai, and do not consider 
V since core genes are identified according to u. PMD can produce 
sparse vectors u by choosing a suitable parameters ai. 

2.2 Identifying core genes by CIPMD 

The gene expression data are stocked as the matrix F with size 
pxn, in which each row of F represents the transcriptional 
responses of a gene in all w samples and each column of F 
represents the expression level of a sample in all p genes. 

According to subsection 2.1.3, the matrix F is decomposed into 
three matrices U, V and D by PMD. The graphical depiction of 
CIPMD is shown in Figure 1. Following the convention in [31], 
we define {v^} (columns of V) as eigenpatterns, {u^} (columns of 
U) as eigensamples and {r,} (rows of U) as eigengenes. As Figure 1 
shows, the space of sample expression profiles S/ (a column of F) is 



spanned by U and the space of gene transcriptional responses r, (a 
row of F) is spanned by V. 

Our goal is to identify the core genes from the gene expression 
data. Generally speaking, due to the complexity of F, it is difficult 
to identify the core genes from F directiy. So we must take 
measures to reduce the dimensionality of the gene expression data. 
As mentioned above, the space of sample expression profiles Sy is 
spanned by U and is a column of U, so we can select a subset of 
Ui to represent F. Then the eigengenes are identified from the 
eigensamples which have the features of gene expression data. 
These eigengenes are regarded as core genes responding to the 
abiotic strc-sscs. The detail of how to identify the core genes from 
the sample expression profiles is shown in the following. 

Firstiy, the number of variables used to denote the sample 
expression profiles can be reduced by CIPMD. According to 

r 

eq.(9), Sj can be formulated as yXkdkVjk, j= 1,2, where Vjk 

is the y'-th element in v^- It shows that Sj is a Unear combination of 
Ui. In Figure 1, Sy is the J-th column of V^, which includes the 
positional information of the j-th sample. By using Sy, the 
expression profiles of samples can be acquired by r variables. 
However, the number of variables in sample expression profiles Sy 
is p which is much larger than r. Therefore, the number of 
variables used to denote the sample expression profiles can 
generally be reduced by CIPMD. 
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Figure 6. Response to stimulus (GO:0050896) in root samples. 
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Secondly, since the eigensamples Uk are used to reconstruct F, 
the sample expression profiles S/ which contain the important 
information can be represented by the eigensamples Ut. 

Thirdly, the sparse can be obtained by choosing the penalty 
function appropriately. According to the subsection 2.1.3, we can 
take penalty function /'i(u)<ai. By choosing a suitable param- 
eters ai, the sparse can be obtained. 

Finally, the core genes responding to abiotic stresses are 
identified via the sparse u^. The features of samples in gene 
expression data can be represented by the nonzero entries in the 
sparse u^. Therefore, the nonzero entries can be denoted as the 
core genes responding to abiotic stresses. 

The whole scheme to identify the core genes can be summarized 
in the following: 

Firstly, the total scatter matrix S( is obtained bases on the gene 
expression data X. 

Secondly, St is decomposed into left singular vectors W, right 
singular vectors H and a diagonal matrix A by using SVD, and a 
new data matrix F is constructed by multiplying W by A. 

Thirdly, PMD decomposes the data matrix F to obtain the 
sparse eigensamples Uk. 

Fourthly, the genes corresponding to nonzero entries in Ui are 
identified as the core ones. 

Finally, the core genes are checked by using Gene Ontology 
(GO) tool. 



Results and Discussion 

In this section, we evaluate the CIPMD method by applying it 
to identify the core genes responding to abiotic stresses. Subsection 
3.1 and 3.2 provide the results on simulation and real gene 
expression data sets, respectively. For comparison, the sparse 
principal component analysis (SPCA) [32], penalized matrix 
decomposition (PMD) [19] and support vector machine-recursive 
feature elimination (SVM-RFE) [25] methods are used to identify 
the features on simulation and real gene expression data sets. The 
LIBSVM that Chang et al. proposed [33] is used to implement 
SVM-RFE algorithm. 

3.1 Results on simulation data 

In this subsection, the simulation data are firstly introduced. 
Then, the parameters of SPCA, PMD and CIPMD are chosen 
appropriately. Since SVM-RFE method eliminate genes one by 
one by using Recursive Feature Elimination (RFE) and have no 
control-sparsity parameters, so we do not consider it in this 
subsection. Finally, the results on simulation data are shown. 

3.1.1 Data source. The simulation data are generated with 
p = 20000 genes (roughly equal to the number of genes in real gene 
expression data) and m=16 samples. In the two-class case, we 
assign 8 samples and /> = 20000genes for each class. In the multi- 
class case, the 16 samples are divided equally into 4 classes. 

The simulation data are in RP with p = 20000 and generated as 
X~(0,^4). Let Vi~V4 be four 20000-dimensional vectors, such 
that vu = l, A:=l,---,125, and vit = 0, A;= 126, • • • , 20000; 
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Figure 7. Response to stress (GO:0006950) in slioot samples. 
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V2,t = l, ^=126,---,250, and V2/t = 0, A:?^ 126, • • • ,250; V3/t = 
1, A: = 251,---,375, and Vu = 0, A:#251, • ■ ■ ,375; y^i^ = \,k = 
376, •••,500, and V4i = 0, ^ # 376, • • • ,500. Let E be a 20000- 
dimensional noise matrix, and E~A^(0,1). Then we add E into v 
with difTerent Signal-to-Noise Ratios (SNR). The preceding four 
eigenvectors of ^4 are normalized to be V/t = V/t/||vA-||, fc = 
1,2,3,4. And in order to make the first four eigenvectors dominate, 
we let the eigenvalues be ci =400, C2 = 300, C3 =200, C4 = 100 and 
C(t = 1 for A: = 5, • • • ,20000. In this way, the simulation idea in [34] 
is applied to generate the simulation data. 

3.1.2 Parameters selection.. In this subsection, the param- 
eter m in eq.(8) is determined by the simulation experiment. Then 
the control-sparsity parameters of the three methods are selected 
appropriately. 

(i) The determinatioii of parameter m: For CIPMD, we 
need to determine the appropriate parameter m in eq.(8) to 
make our method optimal. We randomly generate the 
simulation data by iterating 1 00 times to test the performance 
of CIPMD with different values of in. Figure 2 displays the 
performance of CIPMD with m varying from 0.5 to 5. From 
this figure it can be seen that all the values of w can get very 
high identification accuracies. The best result is achieved 
when m=l, so we take m=l for CIPMD in the following 
experiments. 

(ii) The selection of control-sparsity parameters: Except 
for SVM-RFE, all the other three methods are sparse, whose 



control-sparsity parameters have a great influence on 
identifi[:ation accuracy. The SPCA proposed by Journee et 
Ell. has an excellent performance both in computational speed 
and quaUty [32]. The parameter y in SPCA is used to adjust 
the sparsity of PCs. According to the algorithm of CIPMD, 
the /i-norm of U is taken as the penalty function, i.e. 
||u||i<ai. Since \<ai<^, let ai=a* where 
1 / ^/p < a < 1 . So we can obtain a sparse u by choosing an 
appropriate ai . For simplicity, only one factor is used, that is, 
letfc=l. 

For fair comparison, 500 genes are identified by using these 
methods with their own appropriate parameters. And the Signal- 
to-Noise Ratio (SNR) is set to be 0. 1 when the simulation data are 
generated. 

3.1.3 Simulation results. We randomly generate the simu- 
lation data by iterating 100 times to evaluate the performances of 
the four methods. The specific numerical values of identification 
accuracies of the four methods with different parameters are 
shown in supplementary file (Table SI). For the two-class case, the 
graphical depiction of the identification accuracies of these 
methods with different parameters is shown in Figure 3. From 
this figure, it can be seen that except for SVM-RFE, all the other 
three methods are sensitive to the control-sparsity parameters. The 
identification accuracies of SPCA are monotonically decreasing 
with the control-sparsity parameter when its value is greater than 
0.15. On the contrary, the identification accuracies of PMD and 
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Figure 8. Response to stress (GO:0006950] in root samples. 

doi:10.1371/journal.pone.0106097.g008 



CIPMD are monotonically increasing with the parameters when 
their values are smaller than 0.25. The identification accuracies of 
PMD and CIPMD are stabilized when the parameters are greater 
than 0.25. Moreover, all the four methods can obtain very high 
identification accuracies. Finally, our CIPMD has the highest 
accuracies among the four methods. 

For the multi-class case, the graphical depiction of the 
identification accuracies of these methods with different param- 
eters is shown in Figure 4. Since SVM-RFE was designed to deal 
with the binary gene selection problem, accordingly, it is not 
included in this part. From diis figure, we can see that the 
identification accuracies of all the three methods can reach higher 
values. Similar to the two-class case, the identification accuracies 
of SPCA are monotonically decreasing with the increasing of 
control-sparsity parameter. When the parameters are greater than 
0.2, the identification accuracies of CIPMD can reach the highest 
point and becomes stable. While the parameters are greater than 



0.25, PMD reaches a plateau in terms of identification accuracy. 
Among the three methods, only the identification accuracies of 
CIPMD can reach more than 90%. Furthermore, except for the 
parameter is 0.1, CIPMD outperforms the other methods on 
identification accuracies with all parameters. 

3.2 Results on gene expression data 

The real gene expression data are introduced in subsection 
3.2.1. Then, the gene ontology (GO) analysis is adopted to 
evaluate the performances of the four methods. 

3.2.1 Data source. The raw gene expression data include 
two classes: shoots and roots in each stress. The Affymetrx CEL 
files are downloaded from NASCArrays [http://aflfy.arabidopsis. 
info/] [35], reference numbers are: control, NASCArrays- 137 
cold stress, NASCArrays- 1 38; osmotic stress, NASCArrays- 1 39 
salt stress, NASCArrays- 1 40; drought stress, NASCArrays- 1 4 1 
UV-B light stress, NASCArrays- 144; and heat stress, NASCAr- 



Table 6. The numbers of response to water deprivation (GO:0009415) 


in root samples. 








stress Type SPCA PMD 


CIPMD 


SVM-RFE 




SF PV SF PV 


SF PV 


SF 


PV 


Drought 44 8.8% 1.73E-19 5110.2% 6.21E-26 


69 1 3.8% 7.8E-45 


45 9.0% 


2.53E-20 



doi:l 0.1 371 /journal.pone.Ol 06097.t006 
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Table 7. References about core genes responding to water deprivation in root samples. 



Gene name 


Response to 


References 


At2g33380 


Drought, cold 


Heyndrickx et al. (2012) [40] 


At4g34390 


Drought 


Heyndrickx et al. (2012) [40] 


At5g62470 


Drought 


Seo et al. (2009) [41] 


At3g 14050 


Drought 


Heyndrickx et al. (2012) [40] 


At3g 11820 


Drought, cold 


Heyndrickx et al. (2012) [40] 


At3g 19970 


Drought 


Heyndrickx et al. (2012) [40] 


At5g54490 


Drought 


Heyndrickx et al. (2012) [40] 


At5g27420 


Drought 


Heyndrickx et al. (2012) [40] 


At4g24960 


Drought, cold 


Chen et al. (2002) [42] 


At2g30550 


Drought 


Heyndrickx et al. (2012) [40] 


At3g30775 


Drought 


Sharma et al. (2011) [43] 


At3g63060 


Drought, salt, osmotic 


Koops et al. (2011) [44] 


At3gC9940 


Drought 


Vadassery et al. (2009) [45] 


At4g21440 


Drought 


Heyndrickx et al. (2012) [40] 


At1g73480 


Drought, cold 


Heyndrickx et al. (2012) [40] 


At5g67340 


Drought, cold 


Heyndrickx et al. (2012) [40] 


At4g17500 


Drought 


Heyndrickx et al. (2012) [40] 


At2g 17840 


Drought, cold 


Kiyosue et al. (1994) [46] 


At3g52400 


Drought, cold 


Fujita et al. (2004) [47] 


At4g05100 


Drought 


Heyndrickx et al. (2012) [40] 


At5g24590 


Drought 


Heyndrickx et al. (2012) [40] 


At5g67300 


Drought 


Huang et al. (2008) [48] 


At5g40390 


Drought 


Maruyama et al. (2009) [49] 


At3g 19580 


Drought 


Sakamoto et al. (2000) [50] 


At5g45340 


Drought 


Umezawa et al. (2006) [51] 


At1g22190 


Drought, cold, osmotic 


Rea et al. (2011) [52] 


At3g57530 


Drought, cold 


Heyndrickx et al. (2012) [40] 



doi:1 0.1 371 /journal.pone.Ol 06097.t007 



rays- 146. The number of samples in each stress type is listed in 
Table 1. There are 22810 genes in each sample. The arrays are 
adjusted by using the GC-RMA software by Wu et al. [36] to 
avoid the background of optional noise and normalized by using 
quantUe normalization. The GC-RMA results are gathered in a 
matrix to be processed by SPG A, PMD, SVM-RFE and GIPMD. 

Our method brings in the class information of samples based on 
the total scatter matrix. Therefore, in our experiments, two stress 
types of gene expression data are processed simultaneously. 

3.2.2 Gene Ontology (GO) analysis. Gene Ontology (GO) 
Term Enrichment tools can be used to describe genes in the input 
or query set and to help discover what functions the genes may 
have in common [37]. As a web-based tool, GOTermFinder can 
find the significant GO terms among a list of genes. Therefore, it 
offers some significant informations for the biological explanation 



of high-throughput experiments. The core genes responding to 
abiotic stresses identified by SPCA, PMD, SVM-RFE and 
CIPMD are checked by GOTermFinder which is publicly 
available at http://go.princeton.edu/cgi-bin/GOTermFinder 
[38]. Its threshold parameters are set as following: maximum P- 
value =0.01 and minimum number of gene products = 2. Here, 
only the main results of GO Term Enrichment are shown. 

(i) Terms responding to stimulus: The numbers of genes 
responding to stimulus (GO:0050896), which is the ancestor 
of all the abiotic stresses, are identified by the four methods in 
shoot and root samples are listed in Table 2 and Table 3, 
respectively. The superior results are marked in bold type. 
From the two tables we can see that all these methods can 



Table 8. The nunnbers of response to heat (GO:0009408) in shoot samples. 



stress Type ^PCA PMD CIPMD SVM-RFE 



SF PV SF PV SF PV SF PV 



Heat 41 8.2% 1.13E-22 77 15.4% 2.37E-66 97 19.4% 9.47E-96 None None 



doi:1 0.1 371 /journal.pone.Ol 06097.t008 
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Table 9. References about core genes responding to heat in shoot samples. 



Gene name 


Response to 


References 


At2g43630 


Heat 


Heyndrickx et al. (2012) [40] 


Atlg 14360 


Heat 


Heyndrickx et al. (2012) [40] 


At5g28540 


Heat 


Koizumi et al. (1996) [53] 


At2g20940 


Heat 


Heyndrickx et al. (2012) [40] 


At4g29520 


Heat 


Heyndrickx et al. (2012) [40] 


At4g29330 


Heat 


Heyndrickx et al. (2012) [40] 


At5g22060 


Heat 


Heyndrickx et al. (2012) [40] 


Atlg04980 


Heat 


Heyndrickx et al. (2012) [40] 


At4g00940 


Heat 


Heyndrickx et al. (2012) [40] 


At3gl0800 


Heat 


Gao et al. (2008) [54] 


At4g 16660 


Heat 


Heyndrickx et al. (2012) [40] 


Atlg07410 


Heat 


Heyndrickx et al. (2012) [40] 


At5g56030 


Heat 


Takahashi et al (1992) [55] 


At2g02810 


Heat 


Heyndrickx et al. (2012) [40] 



doi:1 0.1 371 /journal.pone.Ol 06097.t009 



identify genes with very high .sample frequency and very low 
P-value. 

As Table 2 listed, in shoot samples, only in UV-B light stress 
data set, CIPMD method is dominated by PMD. For other stresses 
data sets, CIPMD outperforms the SPCA, PMD and SVM-RFE. 
As Table 3 listed, in root samples, CIPMD performs better than 
the other three methods in aU the stresses data sets except the salt 
stress. In salt stress data set, PMD method is superior to our 
method. 

The sample frequencies of the six different stresses response to 
stimulus in shoot and root samples are shown in Figure 4 and 
Figure 5, respectively. 

From Figure 5, it can be seen that PMD has a higher data point 
on UV-B hght stress data set than SPCA, SVM-RFE and CIPMD. 
However, CIPMD method is superior to PMD, SPCA and SVM- 
RFE in the remaining five stresses data sets of shoot samples. 
Figure 6 shows that only in salt stress data set, CIPMD has a lower 
data point than PMD. CIPMD method outperforms the other 
three methods in a large degree (especially in heat stress data set) in 
other five stresses data sets of root samples. From the two figures 
we can also find that SVM-RFE and CIPMD give more stable 
results in six different stresses data sets than PMD and SPCA 
methods whose results fluctuate up and down in greatly 
amplitudes. 



PMD outperforms the proposed method in some case of the 
experiment, e.g. the UV-B light stress data set in shoot samples 
and the salt stress data set in root samples, the most likely reason is 
that the different distributions of data lead to the different 
performances between methods. This problem also exists in 
elsewhere, for example Zheng et al. proposed a gene selection 
method based on Robust Principal Component Analysis (RPCA) 
to select plants characteristic genes, in their experiments, the 
number of genes responding to abiotic stimulus (GO:0009628) is 
selected by three methods in root samples, the performance of 
RPCA is equal to PMD only in UV-B stress data set, in other data 
sets, RPCA method is superior to the others [39]. 

(ii) Terms responding to stress: Table 4 and Table 5 list 
the gene numbers and P-value of response to stress 
(00:0006950) in shoot and root samples, respectively. The 
superior results are marked in bold type. 

As Table 4 listed, in shoot samples, CIPMD is superior to tiie 
other three methods in all the data sets except UV-B hght stress. 
PMD suppresses our method only in the UV-B light stress data set. 
As Table 5 listed, in root samples, CIPMD is dominated by PMD 
only in salt-stress data set. CIPMD outperforms our competitive 
methods in other five stresses data sets. 

The sample frequencies of response to stress in shoot and root 
samples are shown in Figure 7 and Figure 8, respectively. 



Table 10. Reference about core genes involved 


In heat acclimation in shoot samples. 




Gene name 


Response to 


References 


At5g38895 


Heat 


Heyndrickx et al. (2012) [40] 


Atlg77000 


Heat 


Lim et al (2006) [56] 


At4g02550 


Heat 


Heyndrickx et al. (2012) [40] 


At3g50970 


Heat, drought 


Heyndrickx et al. (2012) [40] 


Atlg 13080 


Heat 


Lim et al (2006) [56] 


At4gll220 


Heat 


Heyndrickx et al. (2012) [40] 


doi:1 0.1 371 /journal.pone.Ol 06097.t01 0 
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Figure 7 shows that CIPMD method owns a lower data point in 
UV-B light stress data set than PMD. But in the rest five stresses 
data sets of shoot samples, our CIPMD is superior to the other 
methods. From Figure 8, it can be proved that PMD has a data 
point performing better than CIPMD only in salt stress data set. 
CIPMD method surpasses PMD and SPCA in a large extent 
(especially in heat stress data set) in other data sets of root samples. 
CIPMD outperforms SVM-RFE in all the data sets with six 
different stresses. Besides, both in shoot and root samples, CIPMD 
and SVM-RFE present more stable results than PMD and SPCA 
in six different stresses data sets. 

(iii) Core genes responding to the stresses: The data of the 
drought stress in root samples and heat stress in shoot 
samples are analyzed to evaluate the core genes identified by 
our method closely related to the stresses. 

For drought stress in root samples. Table 6 gives the sample 
frequency and P-value of response to water deprivation (GO: 
0009415). The background sample frequency of response to water 
deprivation (GO: 0009415) in root samples is 1.4% (421/30324). 
As Table 6 listed, the superior results of the three methods are 
shown in bold type. Obviously, CIPMD can identify more genes 
than the other three methods. 

Moreover, we compare the genes identified by CIPMD with the 
ones identified by PMD, SPCA and SVM-RFE to verify the core 
genes extracted by our method closely related to abiotic stresses. 
Table 7 lists different genes identified by CIPMD and ignored by 
other three methods in the first column. The column of Response 
to represents what stresses the genes response to, and the column 
of Reference denotes the searching results that the authors have 
already confirmed in their literatures. As Table 7 listed, all the 27 
genes selected by CIPMD and neglected by PMD, SPCA and 
SVM-RFE can be searched in literatures. And all these core genes 
are indeed closely related to drought stress. Furthermore, some of 
the genes are also related to cold, osmotic and salt stresses. 

For heat stress in shoot samples. Table 8 lists the sample 
frequency and P-value of response to heat (GO: 0009480). The 
background sample frequency of response to heat (GO: 0009480) 
in shoot samples is 1.0% (298/30324). In Table 8, the superior 
results of the four methods are marked in bold type. Wherein 
SVM-RFE cannot identify effective genes response to heat. It can 
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be seen clearly that CIPMD method can identify more genes than 
the other methods. 

In detail, we compare the genes identified by CIPMD with the 
ones identified by using PMD, SPCA and SVM-RFE. There are 
20 different core genes identified by our method and neglected by 
PMD, SPCA and SVM-RFE. Among these 20 genes, 14 genes 
responding to heat have been confirmed in literatures. We show 
the verified results of the 14 genes in Table 9. The remaining 
genes of the 20 genes are involved in heat acclimation (GO: 
0010286) which is the children of response to heat (GO: 0009408). 
The affirmed results in literatures of the 6 genes are listed in 
Table 10. From the verifications, it is obvious that all the 20 genes 
identified by CIPMD and ignored by PMD, SPCA and SVM- 
RFE are closely related with heat stress. 

Conclusion 

In this study, we proposed a novel Class-information-based 
Penalized Matrix Decomposition method for identifying core 
genes. Our method can achieve a better identification capacity by 
bringing in the class information of samples based on the total 
scatter matrix. By integrating matrix decomposition and the PMD 
method, our method is appropriate to analyze the gene expression 
data. A large number of experiments on simulation and real gene 
expression data demonstrate that our CIPMD method outper- 
forms both PMD, SPCA and SVM-RFE. Thus, our approach is 
effective to identify plants core genes responding to abiotic stresses. 

In the future, we will focus on the biological interpretation of the 
core genes. 
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